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[57] ABSTRACT 

A technique for performing joins in parallel on a multi- 
ple processor database system effectively deals with 
data skew. The join operation is performed in three 
stages with an optional fourth stage. The first stage is a 
preparatory stage, the detail of which depends on the 
underlying join algorithm used. This preparatory stage 
provides pre-processing the results of which are used in 
the following stage as the basis for defining subtasks for 
the Final join operation. The data provided in the first 
stage is used in the second stage to both define subtasks 
and to optimally allocate these subtasks to different 
processors in such a manner that the processors are 
close to equally loaded in the final join operation, even 
in the presence of data skew. This second stage is an 
assignment stage the details of which depend on the 
underlying join algorithm. Once the second stage has 
completed its processing of the subtasks, the subtasks 
are shipped to their assigned processors for processing 
and the final join of the two relations in the third stage. 
The method used in the final join operation depends on 
the underlying join algorithm used. Optionally, during 
the actual join as performed in the third stage, there 
could be a dynamic re-assignment of the subtasks should 
the join operation become unbalanced. 

14 Claims, 6 Drawing Sheets 
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JOINING TWO DATABASE RELATIONS ON A 
COMMON FIELD IN A PARALLEL RELATIONAL 
DATABASE FIELD 

5 

BACKGROUND OF THE INVENTION 

1. Field of the Invention 

The present invention generally relates to managing 
relational databases in a multi-processor environment 
and, more particularly, to joining two database relations 
on a common field in a parallel relational database envi- 
ronment in the presence of data skew by partitioning 
the join operation into separate jobs and optimally 
scheduling the jobs among a plurality of processors. 

2. Description of the Prior Art 

A common operation in relational database systems is 
the natural join of two relations on respective columns 
defined over a common domain. See, for example, the 
description of the natural join at pages 209 and 210 of 
An Introduction to Database Systems, Vol. 1, 3rd Ed., by 20 
C. Date, Addison-Wesley (1982). The result of the join 
is a new relation in which each row is the concatenation 
of two rows, one from each of the original relations, 
such that both rows have the same value in their respec- 
tive join columns. 25 

One popular algorithm for computing the join of two 
relations is the sort-merge technique as described by M. 
Blasgen and K. Eswaran in "Storage and Access in 
Relational Databases", IBM Systems Journal vol. 4, pp. 
363 et seq. (1977). It can be summarized briefly as fol- 30 
lows: First, each of the relations is sorted (if necessary) 
according to the join column. Second, the two sorted 
relations are scanned in the obvious interlocked se- 
quence and merged for rows which have equal values. 

When sort-merge joins are performed in parallel on a 35 
multiple processor database system, there exists a prob- 
lem of data skew that might exist in the join columns of 
the relations. In general, the issue of skew is not ad- 
dressed by the join algorithms described in the litera- 
ture. An early article on parallel sort -merge join is "Par- 40 
allel Algorithms for the Execution of Relational Data- 
base Operations" by D. Bitton, H. Boral, D. J. DeWitt, 
and W. K. Wilkinson, ACM Trans, on Database Systems, 
vol. 8, no. 3, Sep. 1983, pp. 324-353. Bitton et al. pro- 
pose two external parallel sort algorithms which they 45 
call parallel binary merge and block bitonic sort. In 
both algorithms, sorted runs are written to disk and 
two-way merges are used to merge runs from disk. The 
merge tree is mapped to different processors with the 
final merge being sequential. 50 

In "Join and Semijoin Algorithms for a Multiproces- 
sor Database Machine", ACM Trans, on Database Ma- 
chines, vol. 9, no. 1, Mar. 1984, pp. 133-161, P. Valdu- 
riez and G. Gardarin describe the algorithm generalized 
to a k-way merge. The result is p lists which are merged 55 
(assuming p<k) sequentially on a single processor. 

In the paper by J. P. Richardson, H. Lu and K. Mik- 
kilineni entitled "Design and Evaluation of Parallel 
Pipelined Join Algorithms", ACM SIGMOD 1987, San 
Francisco, May 1987, pp. 160-169, a method to parallel- 60 
ize the merge-join operation is described. In the method 
of Richardson et al., the relations Ti and T2 are merged 
to mi and 012 runs. Assuming that Tj is the larger rela- 
tion, each run of Ti is assigned to a processor. Each 
processor merges nu runs of T2 (i.e., the final merge of 65 
T2 is repeated, at least as many times as there are proces- 
sors) and merge-joined with runs of Ti assigned to that 
processor. This method is good when the projections of 



the two relations to be joined are such that one of them 
is very small. They also describe another version of 
their algorithm that is useful if one relation is small. 

S. G. Akl and N. Santoro in "Optimal Parallel Merg- 
ing and Sorting Without Memory Conflicts", IEEE 
Trans, on Comp., vol. C-36, no. 11, Nov. 1987, pp. 
1367-1369, consider merging two sorted lists in parallel 
by partitioning each of the two lists. 

Another popular algorithm for computing the join of 
two relations is the hash join technique described by D. 
J. DeWitt, R. H. Gerber, G. Graefe, M. L. Heytens, K. 

B. Kumar, and M. Maralikrishna in "Multiprocessor 
Hash-based Join Algorithms", Proc. Uth VLDB (1985). 
For a multi-processor system, it can be summarized 
briefly as follows: First, both relations are hashed (if 
necessary) into hash partitions according to. the join 
columns. The number of.hash partitions generally is set 
equal to the number of processors. Then the hash parti- 
tions are distributed among the processors, so that the 
corresponding partitions of the two relations reside on 
the same processor. Second, the corresponding hash 
partitions of the two relations are joined together. 

Although performance of a join query may be sped 
up by the use of multiple processors, the speed up from 
this kind of conventional join algorithm can be very 
limited in the presence of data skew as described by M. 
S. Lakshmi and P. S. Yu in "Effect of Skew on Join 
Performance in Parallel Architectures", Proc. Intl. Sym- 
posium on Databases in Parallel and Distributed Database 
Systems (1988). In "A Performance Evaluation of Four 
Parallel Join Algorithms in a Shared -Nothing Multipro- 
cessor Environment", Proc ACM Sigmod Conference 
(1989), D. A. Schneider and D. J. DeWitt evaluate the 
performance of four parallel join algorithms. They sug- 
gest that in the case of high data skew, an algorithm 
other than hash join should be considered. In ''Remov- 
ing Skew Effect in Join Operation on Parallel Proces- 
sors", Technical Report CSZ>-890027, UCLA (1989), R. 

C. Hu and R. R. Muntz propose a simple hash join 
algorithm to identify the biggest skew element and 
assign multiple processors to handle it. 

SUMMARY OF THE INVENTION 

It is therefore an object of the present invention to 
provide efficient techniques that minimize total execu- 
tion time in a parallel relational database environment 
for the natural join of two relations via the sort-merge 
and hash join techniques. 

It is another object of the invention to provide sort- 
merge and hash join techniques for operation on a mul- 
ti-processor database machine which effectively deal 
with any data skew that might exist in the join columns 
of the relations. 

According to the broad teachings of the invention, 
the join operation is performed in three stages with an 
optional fourth stage. The first stage is a preparatory 
stage, the detail of which depends on the underlying 
join algorithm used. This preparatory stage provides 
pre-processing, the results of which aire used in the 
following stage as the basis for defining subtasks for the 
final join operation. The data provided in the first stage 
is used in the second stage to both define subtasks and to 
optimally allocate these subtasks to different processors 
in such a manner that the processors are close to equally 
loaded in the final join operation, even in the presence 
of data skew. This second stage is an assignment stage 
which is crucial to the invention. The detail of the sec- 
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ond stage is different for different underlying join algo- FIG. 6 is a flow chart showing the details of the 

rithms; however, the general method of defining and second stage of the join operation using a first hash join 

assigning sub-tasks in the second stage is similar for the algorithm according to the invention; 

different underlying algorithms. Once the second stage FIG. 7 is a flow chart showing the details of the 

has completed definition and assignment of the subtasks, 5 second stage of the join operation using a second hash 

data is shipped among processors according to the as- join algorithm according to the invention; and 

signement for processing and the final join of the two FIGS. 8A-8B are diagrams illustrating how sub-tasks 

relations in the third stage. The method used in the final are re-allocated in the hash join operation of the inven- 

join operation depends on the underlying join algorithm tion. 

used Optionally during the actual join as performed in 10 DETAILED DESC RlPTION OF A PREFERRED 

the third stage, there could be a dynamic re-asagnment EMBODIMENT OF THE INVENTION 
of the subtasks should the join operation become unbal- . 

anced. In the following description, it is assumed for the sake 
P As mentioned, the join operation can be implemented of simplicity that each processor in the multiple proces- 
using different underlying join methods. In the first 15 sor database machine has the same computing power, 
aspect of this invention, we describe a method based on The exact nature of the parallel database architecture is 
sort-merge joins in parallel on a multiple processor otherwise largely immaterial. Referring now to the 
database system. In the first stage, the first and second drawings, and more particularly to FIG. 1, there is 
relations are partitioned into a number of sets corre- shown a block diagram of the. general organization of a 
sponding to the number of processors and sorted on the 20 multi-processor database machine. P processors 10| to 
column to be joined. This sorted data from the first 1G> are linked to each other through an interconnect 
stage is provided to the second stage. In the second network 12. The processors exchange data via the inter- 
stage, the sorted data from the first stage are re-parti- connect network 12. The particular method used for 
tioned into ranges and multiplicities to define subtasks interconnecting the various processors is not important 
each of which will join data from one range partition. 25 as long as it has sufficient bandwidth for the tasks at 
The time it will take a single processor to perform each hand. The database machine also uses conventional disk 
of the subtasks is estimated, and the partitions are fur- drives 16ix>i to 16^ for secondary storage. Relations 
ther divided as necessary to balance sort operations to be sorted or joined are stored on these disk drives, 
among the several processors. Jobs are scheduled The distribution of values taken by the join attribute 
among the processors according to a minimum makes- 30 has a significant impact on the execution time of the join 
pan optimization technique. The algorithm deals effec- operation. In many databases, certain values for an 
tively with any data skew that might exist in the join attribute occur more frequently than other values, re- 
columns of the relations by updating the scheduling of suiting in a. skewed distribution. Consider the three 
jobs among the processors based on estimated skew. relations shown in FIG. 2 of CUSTOMERS, ORDERS 

In the second aspect of the invention, we describe a 35 and SUPPLIERS. The Customer ID on the CUSTOM- 
method based on the hash join algorithm in parallel on ERS relation takes unique values, but the same attribute 
a multiple processor database system. The technique is on the ORDERS relation can take non-unique values, 
specifically designed to handle data skew. The pro- since a customer may order more than one item. In 
posed algorithm is based on a hierarchical hashing con- addition, a given item can be ordered by more than one 
cept. Hierarchical hashing is combined with a minimum 40 customer, and several suppliers can supply a given item, 
makespan optimization algorithm to iteratively split up A query which needs to list all the customers with 
the hash partitions and make equitable hash partition outstanding orders would join the ORDERS and CUS- 
assignments in the multiple processors. As part of the TOMERS relations on the Customer ID column, result- 
analysis, the partitioning is refined when the size of a ing in a join with a single skew. A query which needs to 
partition causes the load balancing to be less than desir- 45 list all the suppliers who can fill the outstanding orders 
able. During this evaluative stage, actual hashing is not would join the ORDERS and SUPPLIERS relations 
performed. Instead, the benefits of hashing at each level on the hem column, resulting in a join with a double 
are estimated, so that the optimal hashing can be per- skew. The terms single skew and double skew are used 
formed later. to represent the cases where the join attribute of one or 
„,^„, „ ^ a „ ffV ,^r 50 both the relations have skewed distribution. 

BRIEF DESCRIPTION OF THE DRAWINGS ^ join opcration according t0 ^ invention, as 

The foregoing and other objects, aspects and advan- illustrated in FIG. 3, has three stages and, optionally, a 

tages will be better understood from the following de- fourth stage. Stage 1 is the preparatory stage which is 

tailed description of a preferred embodiment of the common to all aspects of the invention; however, the 

invention with reference to the drawings, in which: 55 detail of this stage varies depending on the underlying 

FIG. 1 is a block diagram of the general organization join algorithm used. This stage is described in more 

of a multi-processor database machine on which the join detail for two underlying join algorithms. More specifi- 

procedures according to the invention may be prac- cally, we first describe Stage 1 for a sort-merge based 

ticed; embodiment of the invention, and then we describe 

FIG. 2 is an example of three relations used to illus- 60 Stage 1 for a hash based embodiment of the invention, 

trate the concept of data skew; The Stage 1 preparatory phase provides preprocess- 

FIG. 3 is a high level block diagram illustrating the ing, and its results are used in Stage 2 as the basis for 

stages of the join operation according to the invention; defining subtasks for the final join operation of Stage 3. 

FIG. 4 is a flow chart showing the details of the More specifically, the data provided by Stage 1 is used 

second stage of the join operation using a sort-merge 65 to both define subtasks and to optimally allocate these 

join algorithm according to the invention; subtasks to different processors in such a manner that 

FIG. 5 is a diagram illustrating how sub-tasks are split the processors are close to equally loaded in the final 

into smaller sub-tasks in a first aspect of the invention; join operation in Stage 3, even in the presence of data 
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skew. Thus, Stage 2 is an assignment stage, and it is V=Vj = V2, we may wish to consider the additional 

crucial to this invention. In the following description, possibility of partitioning the larger of the two sets 
we describe in detail the steps involved in Stage 2 for 
both the son-merge and the two hash based embodi- 
ments of this invention. 5 

Once Stage 2 has completed its definition and assign- UPW* ™* £PW- 
ment of the subtasks, the partitions of the relations to be 

joined are shipped to their assigned processors for pro- ^ evenly as ^ imo M sels of rows eacht where 

cessing and the final join of the two relations in Stage 3. s M ^P, and creating still finer independent subtasks 

The method used in the ^J°^^ on ^^^ 10 r i n n% . . . , In subtask r-y^me{h . . . , M}, 

the underlying join method. Optionally, Stage 4 in- * ■» amJ 

volves processors reporting their progress. Since the w,c w ™ j« wuw» r 

a^entt in Stage Kre based on estimates of subtask *» of the larger set is performed on a single processor, 

times, the join operation commenced in Stage 3 could Assuming, for example, that the first relation contains 

become unbalanced. If the unbalance exceeds a certain 15 the greater number of rows, then the time it takes to 

predefined threshold, the assignment of Stage 2 can be perform this subtask is T™Ki.K2=A(I*i + l2)+BI m |l2, 

optionally altered dynamically. where I^i is the number of rows in the mth partition. 

The first aspect of the invention to be described re- We do not insist that each of the M subtasks be per- 

lates to an underlying join method based on the sort- formed on a different processor, although in practice 

merge join algorithm. Suppose that relation Ri contains 20 this is likely to be the case. Clearly, performing M> 1 

Nj rows while relation R2 contains N2 rows. Let P subtasks is less efficient than performing one, since the 

denote the number of processors in the database system. j nput f rom t h e smaller relation (in this case R2) must be 

Inth ejirst Stage of the invention, the set of rows of shipped to each of the processors involved. We only 

Ri aTeT>artitione d as evenly as possible into E_sets.of make ^ of this appr0 ach to handle excessive skew. We 

approxi mately flj/P ™ws <gcji. In other words,, the 25 will ^ y that lhe type 2 pair (V,,V 2 ) has multiplicity M. 

parthtOn will have Ni^FlNj/Pj KtfoHNi/P] rows pair (V,,V 2 ) will be said to have multiplicity 1. 

each, and P-N^PLIia/JJ sets of [N,/^ rows each, ^ P can s ^ ate 2 0ur general approach: Suppose we 

where W denotes the smallest integer greater than or * f ™ J*, 

eaual to the real number x and Ixj denotes the largest 7. ,™,r . . , r , e Al _ 

fnteger less than or equal to x. £ch processor is then 30 corresponding multiplicities m the domain of the join 

given one of the partitions to sort. The processors columns. This sequence will have the form 
should complete their tasks in more or less equal time. 

The second relation is sorted analogously. At the end of ^ ; <3< v 4 ,av u <v* + i jSV fc+ i. 

this stage there are two sets of P sorted "runs". 2< mm \ <v KA *v K j 

To introduce the second stage of the invention, sup- 35 

pose that Vi^V 2 are two values in the domain of the £ach va]ue in the j oin C0 Iumns of Ri and R 2 is required 

join columns. Given any of the 2P sorted runs, for ex- tQ faJ1 whhin one of the meTV a\s [V W .V«J. For kc{l, 

ample the one corresponding to relation ic{l,2> and x}, let Ma- denote the multiplicity of the pair 

processor je{l P}, there is a well-defined (possibly y } We have created 

empty) contiguous subset irij,v\,vi consisting of all rows 40 
with son column values in the interval [Vj.Vj]. Ship- 
ping each of the tt^vx.vi over to a single processor for * = 2 M k 
final sorting and merging results in an independent sub- '=1 
task T ] y\ t v2 of the total remaining part of the join opera- 
tion. The superscript here underscores the fact that a 45 subtasks T m y^uVk,2^o be done at the P processors. The 
single processor is involved. The significance of this total computing time involved can be estimated as 
will become apparent in the following description. Let 
us estimate the time it takes to perform this subtask as Mk 
TVi,^=A(I,+I 2 )+BO, where 2 2 «iti.wu.. 

R B 1 HI — 1 

50 

//= 2 caniOtyKi.Fa) which we wish to distribute as evenly as possible among 

J= ] the processors. A "perfect" assignment, not necessarily 

_ _ . , possible, would have each processor busy for 

is the number of rows of mput from R„ O is the number 
of rows of output from the merge of Ri and R2, and A 55 

and B are known constants. If we make the assumption K ^ /p 

that the values of each ir/y.j/j.^are uniformly distributed k=im-i Vk - 1 ,KU 

over the Dv\,n elements in the underlying domain 

between Vi and V 2> then we can compute 0=1- units of time. Specifically, we would like to assign each 
ihlDn.n- 60 subtask r m vk t \,Vka to a processor AJj m vk.\ t Vk^) in such 

In the special case where Vi < V2, the computation of a way that the completion time of the total job, 
O is merely an estimate. We shall call a pair (Vi,V 2 ) 
satisfying Vi<V 2 a type 1 pair. In the special case 

where Vi = V2, the equation for O reduces to 0=Iil2- M * 
In fact, the output is just the cross-product of the two 65 m K j ^ 

inputs, so that the formula is precisely accurate in this ig/>sM=i m m ~ n.i.vt* 

case. We shall call a pair (V!,V 2 ) satisfying Vj=V 2 a w*.i.hj> - P 

type 2 pair. Actually, for type 2 pairs, say with 
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is minimized. This optimization problem is the so-called part corresponding to one of two relations is empty, and 

minimum makespan or multiprocessor scheduling prob- the join will yield nothing. 

lem. Although it is known to be NP-complete, there At this point, a test is made in decision block 34 to 

exist a number of very fast heuristics which have rea- determine if the list of subtasks empty. If so. the process 

sonably good worst-case performance and excellent 5 stops; otherwise, the list of subtasks is renumbered and 

average-case performance. Specifically, see R. Graham, reordered in function block 36. Control then loops back 

"Bounds on Multiprocessing Timing Anomalies", to decision block 22. 

SIAM Journal of Computing, vol. 17, 1969, pp. 416 et If the test in decision block 22 is negative, then a 

seq., for a description of the LPT heuristic, and E. CofT- further test is made in decision block 38 to determine if 

man, M. Garey and D. S. Johnson, "An application of 10 the largest element in the list of subtasks is of type 1. If 

Bin Packing to Multiprocessor Scheduling", SIAM so, control goes to function block 32; otherwise, a test is 

Journal of Computing, vol. 7, 1978, pp. 1 et seq., for a made in decision block 40 to determine if it is possible to 

description of the MULTIFIT heuristic. increase the multiplicity of type 2 pair corresponding to 

Of course, we have control over how the ordered the largest subtask in the list to a new multiplicity M<P 
sequence of pairs of values and corresponding multi- 15 in such a way that each subtask now fits? If so, the 
plicities is created. The goal in the second stage of our smallest such multiplicity is found and the list of sub- 
invention is a divide-and-conquer approach to creating tasks is revised in function block 42 to incorporate this 
this ordered sequence. An example of one way to pro- new multiplicity. Additionally, K is adjusted and the 
ceed is the algorithm illustrated in the flowchart of time estimates for the new subtasks are computed before 
pjQ 4 w 20 control goes to function block 36. 

With reference now to FIG. 4, the process begins in If the test in decision block 40 is negative, then in 
function block 20 by initially setting K = 1, V u to be the function block 44 the subtask is assigned to have multi- 
minimum value in the join columns of Ri and R 2 , V u to plicity P, each of the P subtasks is assigned to distinct 
be the maximum, and Mi = 1. In addition, the time for processors, and the subtasks are removed from the list, 
the subtask is estimated, and a list of all current subtasks 25 In addition, K is adjusted. The P subtasks removed will 
is maintained, ordered by subtask time estimates. Ini- compleie in more or less equal time. Control now goes 
tially, there is only one element on the list. Next, a test to decision block 34. 

is made in decision block 22 to determine if the time Our approach favors multiplicity 1 pairs, but allows 

estimate of the largest subtask in the list is less than or for higher multiplicities to overcome excessive skew, 

equal to 1/Pth of the sum of the time estimates of all 30 The locations of these excessive skews should naturally 

subtasks in the list. In other words, does the largest present themselves during the execution of the algo- 

subtask "fit"? If so, the LPT or MULTIFIT algorithms rithm. The subtask time estimates should similarly be- 

are performed in function block 24. The first and most come more and more accurate as the algorithm exe- 

computationally costly step in either algorithm is to cutes. There are many variations on the above theme 

order the subtasks by time estimates. This is already 35 which could also be implemented. For example, LPT or 

done in our case. A test is then made in decision block MULTIFIT need not be performed every time through 

26 to determine if the makespan is within a predeter- function block 24. A counter could regulate the number 

mined multiplicative factor 1+ of a perfect assignment. of executions of the minimum makespan algorithm. 

If so, the process stops; otherwise, a test is made in Similarly, the algorithm could time out according to 

decision block 28 to determine if there are any type 1 40 another counter, even if the solution is not fully satisfac- 

pairs. If not, the process stops; otherwise, the largest tory. 

type 1 pair in the list is selected in decision block 30. In Stage 3 shown in FIG. 3, the sorted tuples of rela- 

Then in function block 32, the largest type 1 pair tions K\ and R 2 are read from disk, and the tuples corre- 

(Vju t V*,2) is split into two to three parts as follows: In sponding to the subtasks are shipped to the assigned 

Rj there are Ii total elements, consisting of P sorted 45 processor. The subtasks are then executed on the as- 

rows TT)j,vui^ In R2 there are I2 total elements, consist- signed processors to perform the final join operation, 

ing of P sorted rows wy.v\,n- All told, we have 2P Optionally, as indicated by Stage 4 in FIG. 3, during 

sorted rows with a total of I1+I2 elements. An algo- the course of the actual join, the processors could re- 

rithm due to Z. Galil and N. Megiddo, "A Fast Selec- port their progress. Since the subtask time estimates are 

tion Algorithm and the Problem of Optimum Distribu- 50 just that, the progress of the join could be come unbal- 

tion of Effort", Journal of the ACM, vol. 26, 1979, pp. 58 anced. If the unbalance exceeds a certain predefined 

et seq., finds the ([1 1 + I 2 j )/2th largest element u. of this threshold, a new LPT or MULTIFIT algorithm could 

set. This is a special case of the so-called selection prob- be initiated. 

lem. In fact, the algorithm divides each set itypi.n into The invention can also be practiced using a method 

three contiguous (possibly empty) regions; ir l \j,v\ t V2, 55 based on the hash-join algorithm and using adoj ihle 

consisting of rows less than u,, ir 2 y,^.^, consisting of hashing. te c hnique implementing^ hierarchical_h ashing 

rows equal to ji, and ir l \j,v\ t Vi* consisting of rows in Stages 1 and 2 of FIG. 3 . Other variations of the 

greater than ft. Thus, we have created three subtasks hierarchical hashing technique are discussed later. The 

where there had been one. The first or third subtask, but double hashing is adopted to help identify the skew 

not both might be empty. Either one could be of type 1 60 values, provide better estimates of the join costs at each 

or type 2. The second subtask will not be empty, and processor, and ultimately make the load balancing more 

will be of type 2, with multiplicity 1. FIG. 5 illustrates even. 

how the three new sub-tasks are created from one old In the first stage of FIG. 3, we start with each proces- 

sub-task using this method. Next, K is adjusted, the sor having close to equal partitions of Ri of size approxi- 

intervals are determined, and the time estimates are 65 mately Nj/P rows each. In other, words, the partition 

computed for each of the new subtasks. If the number of will have Nj-PINj/P] sets of IN]/Pl rows each, and 

rows of output is computed to be zero for any of the P-Ni+P[Ni/PJ sets of [Ni/Pj rows each. Let Hi 

subtasks, the subtask can be deleted from the list; the and H2 be two Eash functions, where H2 is used to fur- 
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ther divide each hash partition created by Hi into finer different processor, although in practice this is likely to 

partitions. Assume Hi can hash the rows into Bi hash be the case. Clearly, performing M>1 subtasks is less 

partitions and H2 can refine each of these partitions into efficient than performing one, since the input from the 

B 2 finer hash partitions. For each processor), let smaller relation (in this case R2) must be shipped to each 

d/;t m (Ri) be the number of rows whose hash value of its 5 of the processors involved. We only make use of this 

join column under H] falls into the kth hash partition of approach to handle excessive skew. We will say that the 

Hi and whose hash value under H2 falls into the roth composite hash partition (k ? {m}) has multiplicity M. 

hash partition of Hj. We refer to the row as having a Otherwise it will be said to have multiplicity 1. 

composite hash value (k,m). Each processor reads in Now we can state our general approach: Suppose we 

tuples ofRi from disk and maintains counts d/jt, m (Rl) for 10 create an ordered sequence of K pairs of the form 

k= 1, . . . , Bj and m= 1, . . . , B 2 in the main memory of (V^i.Vu) where V*,i«.{l. . . . , Bj) and V*,2*{1, 

each processor j to keep track of how many tuples will Bl y In tne w here Vk,2 contains just a single ele- 

be hashed into the fine-grained partitions based on H| mem% the (y k h V k 2 ) pair is referred to as a type 2 pair, 

and H 2 . The processors should complete their tasks in otherwise, it "is referred to as a type 1 pair. The se- 

more or less equal time The second relation is hashed 15 quence G f K pairs is required to partition the set of 

analogously, yielding d/*, m (R 2 ); At the end of this stage, con ,posite hash classes in the sense that for each com- 

we have two sets of counts, d/jt, m (R/),i = 1.2. posite hash class (ij), there must exist a unique Kc{l, . 

At the start of the second stage of FIG. 3, the K} such that J ^ ^ Vk 2 Fo r kc{l, . . . , K}, 

d/A. m (R,) are added across the processors to obtain j e t Mjtdenote the multiplicity of the pair (V*,i,V*, 2 ). We 

20 have created subtasks T m yu,Vk,2 t0 done at the P 

„ ,„ v — . i /d\ processors. The total computing time involved can be 

^Rfi-^j^™ estimated as 

Given any of the 2PBj hash partitions from Hi, for ^ 

example the one corresponding to relation i€{l,2}, pro- 25 1 2 ^ ^ 

cesser j£{l P}» and hash partition k*{l Bi>, * =,mB ' 

and eiven a subset E<{1, .... B2}, there is a well-de- 

fined (possibly empty) subset ir^ consisting of all which we wish to distribute as evenly as possible among 
rows with-composite hash values (k,m), mcE. Shipping the processors. A "perfect" assignment, not necessarily 
each of the Tr^fover to a single processor for final join 30 possible, would have each processor busy for 
results in an independent subtask T l k,E of the total re- 
maining part of the join operation. The superscript here K M k 
underscores the fact that a single processor is involved. ^1 j ^ ^uvu^ 
The significance of this will become apparent shortly 

The time it takes to perform this subtask is estimated as 35 ^ ^ ^ Specifically we wouId , ike to assign eaC h 

Vk.E= Afli + I2HBO, where subtask T™vk t \.Vk,2 to a processor A(r m n A ,v k .2) in such 



a way that the completion time of the total job, 



2 0A. m (J?/) 



mtE 



40 



is the number of rows of input from R/.O is the number ^ K ^ * 

of rows of output from the merge of Ri and R2, and A i£^/>*-i „ 

and B are known constants. If it is assumed that the ^ T >'*.U'u> - p 

values of each irij.k, m . meE, are uniformly distributed . 

over the Djt, w elements in the underlying domain with is minimized. This optimization problem is the so-called 

composite hash value (k,m), then we can compute minimum makespan or multiprocessor scheduling prob- 
lem. Although it is known to be NP-complete, there 
exist a number of very fast heuristics which have rea- 

o = J, £ $k.m(R\)Pk.nAR2)/nk.m' 5Q sonably good worst<ase performance and excellent 

average-case performance. 

For the case E contains a single value, say m. we may The point is that we can exercise control over how 

wish to consider the additional possibility of partition- the ordered sequence of pairs of values and correspond- 

tag the larger of the two sets »g multiplicities is created. The goal in the second 

* 6 55 stage of our invention is a divide-and-conquer approach 

to creating thi s ordered sequence . We present two alter- 

P P native examples of possible ways to proceed in Stage 2. 

V pijXM and V pijXM The f irst method uses a variant of the LPT heuristic 

1 J ~ 1 by Graham and is illustrated in the flowchart of FIG. 6. 

as evenly as possible into M sets of rows each, where 60 Referring now to FIG. 6, initially hvfunction block 50, 

I5MSP, and creating still finer independent subtasks the sets {<k,F),k=l, . . . , Bi>, where F-U, . • • . 

rilrr r"7/ , In subtask rhi m \ HI, . . . , M}, the are sorted in descending order according to the subtask 

Jo^f the £JfcK? anSe ^partition of the value. T> A list of all current subtasks is maintained, 

larger set is performed on a single processor. Assuming. ordered by subttsk time estimates. A test is then made in 

for example, that the first relation contains the greater 65 decision block 52 to determine ifthe time esimae of the 

number of rows, then the time it takes to perform this largest subtask in the list is less than or equal to 1/Pth of 

subtask is T4im>=A((I|/M)+h)+BIil2/M. We do the sum of the time estimates of all subtasks in the list. In 

not insist that each of the M subtasks be performed on a other words, does the largest subtask "fit"? If not. a test 
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is made in decision block 54 to determine if the largest 
element in the list of subtasks of type 1. If so, then in 
function block 56 the largest type 1 pair (V/u.Vju) is 
split into card (V*. 2 ) type 2 pairs, each of multiplicity 1. 
Next, the list of subtasks are renumbered and the list of 
subtasks is reordered in function block 58 before control 
loops back to decision block 52. 

If the test in decision block 54 is negative, a further 
test is made in decision block 60 to determine if it possi- 
ble to increase the multiplicity of the type 2 pair corre- 
sponding to the largest subtask in the list to a new multi- 
plicity M<P in such a way that each subtask now fits. 
If so, the smallest such multiplicity is found in function 
block 62, and the list of subtasks is revised to incorpo- 
rate this new multiplicity. Additionally, K is adjusted 
and the time estimates are computed for the new sub- 
tasks. The process then goes to function block 58. 

If the test in decision block 60 is negative, then in 
function block 64 the subtask is assigned to have multi- 
plicity P, each of the P subtasks is assigned to distinct 20 
processors, and the subtasks are removed from the list. 
K is also adjusted. The P subtasks removed will com- 
plete in more or less equal time. A test is next made in 
decision block 66 to determine if the list of subtasks 
empty. If so, the process stops; otherwise, the process 25 
loops back to function block 58. 

When the test in decision block 52 is positive, all 
subtasks fit. In function block 68, the target makespan 
time TARGET is computed as 1/Pth of the sum of the 
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other words, does the largest subtask 'Tit'*? If so, LPT 
or MULTIFIT is performed in function block 94. 
Again, the first and most computationally costly step in 
either algorithm is to order the subtasks by time esti- 
mates. This is already done in our case. A test is next 
made in decision block 96 to determine if the makespan 
within a predetermined multiplicative factor 1 + A of a 
perfect assignment. If so, the process stops; otherwise, a 
test is made in decision block 98 to determine if any type 
1 pair remains. If not, the process stops; otherwise, the 
largest type 1 pair in the list is selected in function block 
100. Next, in function block 102, the largest type 1 pair 
(V*,i,Vju) is split into two pairs, as follows: For each 
element mcVjt 2, there is a corresponding subtask time 
Vy k ,uim}- Perform LPT or MULTIFIT on the set 
{rVkj'{m}\m€V kt2 }, splitting V*, 2 into two subsets VU.2 
and V*ka of approximately equal total subtask time 
estimates. If any of the new subtasks is estimated to be 
of time 0, it can be deleted from the list. In addition, K 
is adjusted. Alternatively, one could split the largest 
type 1 pair into any number Q of groupings between 2 
and the cardinality of V*,2. The choice Q-2 is chosen 
for exposition. A test is now made in decision block 104 
to determine if the list of subtasks empty. If so, the 
process stops; otherwise, the list of subtasks is renum- 
bered and reordered in function block 106 before con- 
trol loops back to decision block 92. 

If the test in decision block 92 is negative, a test is 
made in decision block 108 to determine if the largest 



time estimates of all subtasks in the list. The key point is 30 element in the list of subtasks is of type 1? If so, control 



that this target is now stable. Although we may split 
type 1 pairs into type 2 pairs in subsequent steps, the sum 
of the time estimates will remain unchanged. Our LPT 
variant takes place in the following steps. 

Consider the processor p for which the difference 35 
between TARGET and the sum of the subtask times 
already assigned to it is maximal. Consider the largest 
subtask (Vjtj,Vjfc t2 ) remaining on the list. A test is made 
in block 70 to determine if it is a type 2 subtask. If so in 
function block 72, subtask (V*,i,V*. 2 ) is assigned to 40 
processor p, and the subtask is removed from the list. 
Then, in decision block 80, a test is made to determine 
if the list of subtasks is empty. If so, the process ends; 
otherwise, the process loops back to decision block 70. 

If the decision in decision block 70 is negative, a test 45 
is made in decision block 78 to determine whether if 
(Vju.Vfcj) were assigned to processor p, would the 
revised difference between TARGET and the sum of 
the subtask times assigned to it be negative. If not, con- 
trol goes to function block 72; otherwise, in function 50 

block 74, (V*j ( V*,2) is s P ht int0 card (Vte) W e 2 P*"*' 
each of multiplicity 1. The list of subtasks is renumbered 
and reordered in function block 76 before control re- 
turns to decision block 70. 

The second hash join algorithm for Stage 2 of the join 
operation is illustrated by the flowchart in FIG. 7. As 
may be observed by comparing the flowcharts of FIGS, 
4 and 7, the overall logic is quite similar between the 
sort-merge join algorithm and this hash join algorithm. 
With reference now to FIG. 7, the process begins in 
function block 90 by initially taking the set {(k,F),k= 1, 
. . . , Bi}, where F={l, . . . , B2}, and sort them in 
descen ding order a ccording to the subtask value, T 1 *^. 
We wiirmaintain a list of all current subtasks, ordered 
by subtask time estimates. A test is then made in deci- 
sion block 92 to determine if the time estimate of the 
largest subtask in the list is less than or equal to 1/Pth of 
the sum of the time estimates of all subtasks in the list. In 



55 



60 



65 



goes to function block 102; otherwise, a test is made in 
decision block 110 to determine if it is possible to in- 
crease the multiplicity of the type 2 pair corresponding 
to the largest subtask in the list to a new multiplicity 
M < P in such a way that each subtask now fits. If so, the 
smallest such multiplicity is found and the list of sub- 
tasks is revised to incorporate this new multiplicity in 
function block 112. K is adjusted, and the time estimates 
are computed for the new subtasks. Control then goes 
to function block 106. If the test in decision block 110 is 
negative, then in function block 114, the subtask is as- 
signed to have multiplicity P, each of the P subtasks is 
assigned to distinct processors, and the subtasks are 
removed from the list. In addition, K is adjusted. The P 
subtasks removed will complete in more or less equal 
time. The process then goes to function block 104. 

Our approach favors multiplicity 1 pairs but allows 
for higher multiplicities to overcome excessive skew. 
The locations of these excessive skews should naturally 
present themselves during the execution of the algo- 
rithm. There are many variations on the above methods 
which could also be implemented. For example, in ei- 
ther method an iterative improvement algorithm could 
be applied at the end. This method is illustrated in 
FIGS. 8A-8B. In this figure, the GOAL represents 
1/Pth of the sum of the time estimates of all subtasks. 
The method picks the processor P with the largest sum 
of the time estimates of subtasks assigned to it. At pro- 
cessor P, the largest type 1 subtask (Vaj.Vju) is picked 
and is split into card (Vju) type 2 each of multiplicity 1. 
These tasks are then re-assigned to the processors, the 
largest subtask first, as illustrated in FIGS. 8A-8B. 

As in Stage 2 for the sort-merge join algorithm illus- 
trated in FIG. 4, in this method for Stage 2 based on the 
second hash join algorithm as illustrated in FIG. 7, LPT 
or MULTIFIT need not be performed every time 
through function block 94. A counter could regulate the 
number of executions of the minimum makespan algo- 
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rithra. Similarly, the algorithm could time out accord- execution time estimates divided by the number of pro- 

ing to another counter, even if the solution is not fully cessors. Thus, replacement jobs are defined where each 

satisfactory. replacement job consists of the task of joining on of the 

In Stage 3 shown in FIG. 3, the tuples of relations Rj smaller partitions of the one relation with the un- 

-and R 2 are read from disk, the hash functions Hi and H 2 5 changed corresponding partition of the other relation, 

are applied, and the tuples shipped to the assigned pro- Once the process of subpanitioning has been com- 

cessors as determined in Stage 2. pleted, the current set of execution time estimates are 

Optionally in Stage 4 shown in FIG. 3, during the updated by replacing the execution time estimates of the 

course of the actual join, the processors could report replaced jobs with execution time estimates for the 

their progress. Since the subtask time estimates are just 10 replacement jobs. The process is iteratively performed 

that, the progress of the join could become unbalanced. until there is no execution time estimate for any job in 

If the unbalance exceeds a certain predefined threshold, the current set of jobs which is greater than the sum of 

a new LPT or MULTIFIT algorithm could be initiated. the execution time estimates divided by the number of 

We note that there are many alternative ways to processors, 
implement the hierarchical hashing concept. One alter- 15 An important part of the invention involves the re- 
native is to keep a most-frequently-used list on the dif- finement of the processor assignments. This is done by 
ferent join column values encountered during Stage 1 scheduling the current set of jobs among the processors 
when the hashing is performed. The occurrence fre- using a minimum makespan optimization technique, 
quency would also be maintained. If the occurrence rate The amount of completion time skew which would 
of a join column value drops below a predefined thresh- 20 result if the processors were to perform the current set 
old. it would be removed from the list. Another alterna- of jobs as scheduling is estimated. This estimate is com- 
tive is to do sampling of join column values during pared with a skew standard, and if the estimate is ac- 
Stage 1 when the hashing is performed. The most-fre- ceptable, then the cun-ent set of jobs as scheduled is 
quently-used list or sampling results could then be used executed on the processors, thereby joining the two 
to identify the skew distribution and guide which parti- 25 database relations on a common field with minimum 
tions are to be further partitioned. ^completion time skew. 

In summary, we have provided a method of joining If the estimated skew does not meet the skew stan- 
two database relations on a common field in a parallel dard, then one of the l argest parti tions of the first and 
relational database environment having a plurality of secgndTrejations j s replaced with at least two smaller 
processors based on using either sort-merge or hash join 30 parti tions to^fin e^pjaceme^ the job defined 
algorithms. The process involves partitioning a first by the replaced^partitions. Each replacement job con- 
relation into a multiplicity of partitions and a second sists of "the task of joi ning ojie of the smaller repl acing 
relation into a corresponding multiplicity of partitions partitionsoT the first relation with the cor responding 
such that each individual value in the common field smaller repla cing p artition jfrlfte secono jeiation. The 
corresponds uniquely to the corresponding one of the 35 currenTseTof execution time estimates is tKen^updated 
partitions of the second relation, thereby defining a by replacing the execution time estimates of the re- 
current set of jobs, each consisting of the task of joining placed jobs with execution time estimates for the re- 
a partition of the first relation with the corresponding placement jobs. In addition, the scheduling of the cur- 
partition of the second relation. The time it will take for rent set of jobs is updated so as to minimize the esti- 
a single processor to accomplish each of the jobs is 40 mated completion time skew. The amount of comple- 
estimated, these estimates forming a current set of exe- tion time skew is estimated and compared with the skew 
cution time estimates. " standard. This process is iteratively repeated until either 

Next, it is necessary to reduce the jobs to a time less the amount of skew meets the skew standard or all 

than or equal to the sum of the time estimates divided by partitions have been subpartitioned to a predetermined 

the number of processors. This is done by subpartition- 45 maximum extent. 

ing any partitions which correspond to jobs having an While the invention has been described in terms of 

execution time estimate which is greater than the sum of preferred embodiments based on sort-merge and hash 

the time estimates divided by the number of processors. join algorithms, those skilled in the art will recognize 

The sub partition ing is accomplished by replacing any that the invention can be practiced with modifications 

such partition oftEe first relation and the corresponding 50 within the spirit and scope of the appended claims, 

partition of the second relation which include more Having thus described our invention, what we claim 

than one distinct value in the common field with at least as new and desire to secure by Letters Patent is as fol- 

two smaller partitions, each individual value in the com- lows: 

mon field corresponding uniquely to one of the smaller 1. In a multi-processor database machine having P 
partitions of the first relation and uniquely to the corre- 55 parallel processors linked via an interconnect network, 
sponding one of the smaller partitions of the second said P processors having storage means for storing data- 
relation. This defines replacement jobs in the current set base relations to be joined, a method performed by said 
of jobs where each replacement job consists of the task database machine of joining first and second database 
of joining one of such smaller partitions of the first relations stored on said storage means on a common 
relation with the corresponding smaller partition of the 60 field, said method comprising the steps of: 
second relation. With respect to any such job where the (a) partitioning the first relation into a multiplicity of 
corresponding partitions of the first and second rela- partitions and the second relation into a corre- 
tions include only one distinct value in the common sponding multiplicity of partitions such that each 
field, the larger of such corresponding partitions is re- individual value in the common field corresponds 
placed with a plurality of smaller partitions, the number 65 uniquely to one of the partitions of the first relation 
of such smaller partitions being the minimum integer and uniquely to the corresponding one of the parti- 
which when divided into the execution time estimate tions of the second relation to define a current set 
for such job results in a time less than the sum of the of jobs, each consisting of a task of joining a parti- 
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tion of the first relation with a corresponding parti- 
tion of the second relation; 

(b) scheduling a current set of jobs among said P 
processors using a minimum makespan optimiza- 
tion technique that schedules said jobs such that a 5 
maximum running time on any processor is mini- 
mized; 

(c) estimating the amount of completion time skew 
which would result if P processors were to per- 
form the current set of jobs as scheduled in step (b); 10 

(d) comparing the estimated amount of skew with a 
skew standard; and 

(e) if the estimated skew meets the skew standard, 
executing the current set of jobs as last scheduled 
using P processors, thereby joining said two data- 15 
base relations on a common field with minimum 
completion time skew. 

2. The method recited in claim 1 wherein said parti- 
tioning step results in larger and smaller partitions of the 
first and second relations and if the estimated skew does 20 
not meet the skew standard, further comprising the 
steps of: 

(0 replacing one of the larger partitions of the first 
and second relations with at least two smaller parti- 
tions, thereby defining replacement jobs for the job 25 
defined by the replaced partitions, each such re- 
placement job consisting of a task of joining one of 
the smaller replacing partitions of the first relation 
with the corresponding smaller replacing partition 
of the second relation; 30 

(g) updating a current set of execution time estimates 
by replacing the execution time estimates of the 
replaced jobs with execution time estimates for the 
replacement jobs; 

(h) updating the scheduling of the current set of jobs 35 
among P processors so as to minimize the estimated 
completion time skew; 

(i) estimating the amount of completion time skew 
which would result if P processors were to per- 
form the current set of jobs as scheduled in step (h); 40 

(j) comparing the amount of completion time skew 
estimated in step (i) with the skew standard; and 

(k) iteratively performing steps (e) through (j) until 
either the amount of skew meets the skew standard 
in step (j) or all partitions have been subpartitioned 45 
to a predetermined maximum extent. 

3. The method recited in claim 2 wherein the step of 
partitioning further comprises the steps of: 

(1) estimating the time it will take for a single proces- 
sor to accomplish each of said jobs, said estimates 50 
forming a current set of execution time estimates; 

(m) subpartitioning any partitions which correspond 
to jobs having an execution time estimate which is 
greater than the sum of the execution time esti- 
mates divided by P; 55 

(n) updating the current set of execution time esti- 
mates by replacing the execution time estimates of 
the replaced jobs with execution time estimates for 
the replacement jobs; and 

(o) iteratively performing the steps (m) and (n) until 60 
there is no execution time estimate for any job in 
the current set of jobs which is greater than the 
sum of the execution time estimates divided by P. 

4. The method recited in claim 3 wherein the step of 
subpartitioning further comprises the steps of: 65 

(p) replacing any such partition of the first relation 
and corresponding partition of the second relation 
which include more than one distinct value in the 
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common field with at least two smaller partitions, 
each individual value in the common field corre- 
sponding uniquely to one of the smaller partitions 
of the first relation and uniquely to the correspond- 
ing one of the smaller partitions of the second rela- 
tion, thereby defining replacement jobs for any step 
(p) jobs in the current set of jobs, each such re- 
placement job consisting of the task of joining one 
of such smaller partitions of the first relation with 
the corresponding smaller partition of the second 
relation; and 

(q) with respect to any such job where the corre- 
sponding partitions of the first and second relations 
include only one distinct value in the common 
field, replacing the larger of such corresponding 
partitions with X smaller partitions, where X is the 
minimum integer which when divided into the 
execution time estimate for such job results in a 
time less than the sum of the execution time esti- 
mates divided by P, thereby defining replacement 
jobs for any such step (q) jobs in the current set of 
jobs, each such replacement job consisting of the 
task of joining one of the X smaller partitions of the 
one relation with the unchanged corresponding 
partition of the other relation. 

5. The method recited in claim 2 wherein Vj and V2 
are two values in a domain of join columns of said first 
and second relations and a pair (Vi,V2) satisfying 
Vj < V2 is defined as a type 1 pair and a pair (V i,V2> 
satisfying Vj = V2is defined as a type 2 pair, and if the 
estimated skew does not meet the skew standard, fur- 
ther comprising prior to step (f) the steps of: 

(r) selecting the largest type 1 pair for replacing with 
at least two smaller partitions; but 

(s) if there are no type 1 pairs, then executing the 
current set of jobs as last scheduled using P proces- 
sors, thereby joining said two database relations on 
a common field with minimum completion time 
skew. 

6. The method recited in claim 3 wherein V| and V2 
are two values in a domain of join columns of said first 
and second relations and a pair (Vi,V 2 ) satisfying 
V]<V2 is defined as a type 1 pair and a pair (Vi,V 2 ) 
satisfying Vj=V2 is defined as a type 2 pair, and 
wherein when a time estimate for one of said jobs is 
greater than 1/P of a total execution time in step (m) 
further comprising the steps of: 

(t) determining if said one job is a type 1 pair; and 
(u) if a type 1 pair, then replacing it with at least two 
partitions, one of which is of type 2. 

7. The method recited in claim 6 wherein if said one 
job is a type 2 pair, then performing the following steps: 

(v) determining if the multiplicity of the type 2 pair 

can be increased; and 
(w) if the multiplicity can be increased, finding the 

smallest multiplicity and revising the current set of 

jobs. 

8. The method recited in claim 7 wherein if the multi- 
plicity of the type 2 pair cannot be increased, then per- 
forming the steps of: 

(x) assigning the type 2 pair to have a multiplicity P; 
and 

(y) assigning each of the P jobs to distinct processors. 

9. The method recited in claim 1 wherein step (a) is 
performed by a hierarchical double hashing technique 
using two hash functions Hj and H2 where H2 is used to 
further divide each hash partition created by Hi into 
finer partitions. 
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10. The method recited in claim 1 wherein during 
step (e) monitoring the progress of said processors in 
the join operation, and if the progress of the join opera- 
tion becomes unbalanced exceeding a predetermine 
threshold, repeating steps (b) to (d). 

11. A method of joining first and second database 
relations on a common field in a parallel relational data- 
base environment having P processors comprising a 
database machine, comprising the database machine 
performed steps of: 

(a) partitioning the first relation into a multiplicity of 
partitions and the second relation into a corre- 
sponding multiplicity of partitions such that each 
individual value in the common field corresponds 
uniquely to one of the partitions of the first relation 
and uniquely to the corresponding one of the parti- 
tions of the second relation to define a current set 
of jobs, each consisting of the task of joining a 
partition of the first relation with the correspond- 
ing partition of the second relation; 

(b) estimating the time it will take for a single proces- 
sor to accomplish each of said jobs, said estimates 
forming a current set of execution time estimates; 

(c) subpartitioning any partitions which correspond 

to jobs having an execution time estimate which is 25 
greater than the sum of the execution time esti- 
mates divided by P, by: 

(1) replacing any such partition of the first relation 
and corresponding partition of the second rela- 
tion which include more than one distinct value 30 
in the common field with at least two smaller 
partitions, each individual value in the common 
field corresponding uniquely to one of the 
smaller partitions of the first relation and 
uniquely to the corresponding one of the smaller 35 
partitions of the second relation to define re- 
placement jobs for any step (c)(1) jobs in the 
current set of jobs, each such replacement job 
consisting of the task of joining one of such 
smaller partitions of the first relation with the 40 
corresponding smaller partition of the second 
relation; and 

(2) with respect to any such job where the corre- 
sponding partitions of the first and second rela- 
tions include only one distinct value in the com- 45 
mon field, replacing the larger of such corre- 
sponding partitions with X smaller partitions, 
where X is the minimum integer which results in 

a time less than the sum of the execution time 
estimates divided by P, thereby defining replace- 50 
ment jobs for any such step (c)(2) jobs in the 
current set of jobs, each such replacement job 
consisting of the task of joining one of the X 
smaller partitions of one relation with an un- 
changed corresponding partition of another rela- 
tion; 

(d) updating the current set of execution time esti- 
mates by replacing the execution time estimates of 
the replaced jobs with execution time estimates for 
the replacement jobs; 

(e) iteratively performing the steps (c) and (d) until 
there is no execution time estimate for any job in 
the current set of jobs which is greater than the 
sum of the execution time estimates divided by P; 

(0 scheduling the current set of jobs among P proces- 
sors using a minimum makespan optimization tech- 
nique that minimizes a maximum running time on a 
longest running processor; 
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(g) estimating an amount of completion time skew 
which would result if P processors were to per- 
form the current set of jobs as scheduled in step (0; 

(h) comparing the estimated amount of skew with a 
skew standard; 

(i) if the estimated skew meets the skew standard, 
skipping steps (j) through (o); 

(j) if the estimated skew does not meet the skew stan- 
dard, replacing a larger partition of the first and 
second relations with at least two smaller parti- 
tions, but not replacing any partitions formed 
through performing step (c)(2), to define replace- 
ment jobs for the job defined by the replaced parti- 
tions, each such replacement job consisting of a 
task of joining one of the smaller replacing parti- 
tions of the first relation with a corresponding 
smaller replacing partition of the second relation; 

(k) updating the current set of execution time esti- 
mates by replacing the execution time estimates for 
the replacement jobs; 

(1) updating scheduling of the current set of jobs 
among P processors so as to minimize estimated 
completion time skew; 

(m) estimating an amount of completion time skew 
which would result if P processors were to per- 
form the current set of jobs as scheduled in step (1); 

(n) comparing the amount of skew estimates in step 
(m) with the skew standard; 

(o) iteratively performing steps (j) through (n) until 
either the amount of skew meets the skew standard 
in step (n) or all partitions have been subpartitioned 
to a predetermined maximum extent; and 

(p) executing the current set of jobs as last scheduled 
using said P processors to join said two database 
relations on a common field with minimum com- 
pletion time skew. 

12. Apparatus in a database machine for joining first 
and second database relations on a common field in a 
parallel relational database environment having P pro- 
cessors linked by an interconnect network, said appara- 
tus comprising: 

(a) means for partitioning the first relation into a 
multiplicity of partitions and the second relation 
into a corresponding multiplicity of partitions such 
that each individual value in the common field 
corresponds uniquely to one of the partitions of the 
first relation and uniquely to the corresponding one 
of the partitions of the second relation to define a 
current set of jobs, each consisting of a task of 
joining a partition of the first relation with the 
corresponding partition of the second relation; 

(b) means for scheduling a current set of jobs among 
P processors using a minimum makespan optimiza- 
tion technique that schedules said jobs such that a 
maximum running time on any processor is mini- 
mized; 

(c) means for estimating an amount of completion 
time skew which would result if said P processors 
were to perform the current set of jobs as sched- 
uled by said means for scheduling; 

(d) means for comparing the estimated amount of 
skew with a skew standard; and 

(e) means for executing the current set of jobs as last 
scheduled using P processors to join said first and 
second database relations on a common field with 
minimum completion time skew if the estimated 
skew meets the skew standard as determined by 
said means for comparing. 
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13. The apparatus in said database machine recited in 
claim 12 wherein the partitioning by said means for 
. iteratively partitioning results in larger and smaller 
partitions of the first and second relations, further com- 
prising: 

(0 means for replacing one of the larger partitions of 
the first and second relations with at least two 
smaller partitions if the estimated skew does not 
meet the skew standard, thereby defining replace- 
ment jobs for the job defined by the replaced parti- 10 
tions, each such replacement job consisting of a 
task of joining one of the smaller replacing parti- 
tions of the first relation with the corresponding 
smaller replacing partition of the second relation; 

(g) first means for updating a current set of execution 15 
time estimates by replacing the execution time 
estimates of the replaced jobs with execution time 
estimates for the replacement jobs; 

(h) second means for updating the scheduling of the 
current set of jobs among P processors so as to 20 
minimize the estimated completion time skew; 

(i) second means for estimating the amount of com- 
pletion time skew which would result if P proces- 
sors were to perform the current set of jobs as 
scheduled by said second means for updating; 25 

(j) means for comparing the amount of completion 
time skew estimated by said second means for up- 
dating with the skew standard; and 
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(k) said means for replacing, first and second means 
for updating, second means for estimating, and 
means for comparing iteratively operating until 
either the amount of skew meets the skew standard 
or all partitions have been subpartitioned to a pre- 
determined maximum extent. 
14. The apparatus in said database machine recited in 
claim 13 said means for iteratively partitioning further 
comprises: 

(1) third means for estimating a time it will take for a 
single processor to accomplish each of said jobs, 
said, estimates forming a current set of execution 
time estimates; 
(m) means for subpartitioning any partitions which 
correspond to jobs having an execution time esti- 
mate which is greater than the sum of the execution 
time estimates divided by P; and 
(n) third means for updating the current set of execu- 
tion time estimates by replacing the execution time 
estimates of the replaced jobs with execution time 
estimates for the replacement jobs; 
(o) said third means for estimating, means for subparr 
titioning and third means for updating iteratively 
operating until there is no execution time estimate 
for any job in the current set of jobs which is 
greater than the sum of the execution time esti- 
mates divided by P. 

***** 
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